In [46]:
#!jupyter nbconvert --to html --allow-chromium-download AdmissionRateAnalysis.ipynb
[NbConvertApp] Converting notebook AdmissionRateAnalysis.ipynb to html
[NbConvertApp] Writing 1101862 bytes to AdmissionRateAnalysis.html

College admission: https://www.kaggle.com/datasets/sumithbhongale/american-university-data-ipeds-dataset

Research ideas:

  • what social, economic, and regional factors affect admission rate at undergraduate institutions?
  • do more expensive colleges actually breed more successful students? - could not do because didn't really have data on success of students

Opportunities for error / misinterpretation:

  • each state / region has a lot of colleges with different acceptance rates
  • non-normalized number of observations per category, may skew data
  • non linear relationship between tuition and admission rate, couldn't find a fix

Recommendations:

  • collect % in state vs out state students
  • % first-generation students (technically had data, was just missing)
  • avg income of parents/caregivers
  • avg test scores

Read in data¶

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import altair as alt
import seaborn as sns
import math
import numpy as np
In [3]:
# read in data
df = pd.read_excel("./IPEDS_data.xlsx")
In [4]:
# list all cols
list(df.columns)
Out[4]:
['ID number',
 'Name',
 'year',
 'ZIP code',
 'Highest degree offered',
 'County name',
 'Longitude location of institution',
 'Latitude location of institution',
 'Religious affiliation',
 'Offers Less than one year certificate',
 'Offers One but less than two years certificate',
 "Offers Associate's degree",
 'Offers Two but less than 4 years certificate',
 "Offers Bachelor's degree",
 'Offers Postbaccalaureate certificate',
 "Offers Master's degree",
 "Offers Post-master's certificate",
 "Offers Doctor's degree - research/scholarship",
 "Offers Doctor's degree - professional practice",
 "Offers Doctor's degree - other",
 'Offers Other degree',
 'Applicants total',
 'Admissions total',
 'Enrolled total',
 'Percent of freshmen submitting SAT scores',
 'Percent of freshmen submitting ACT scores',
 'SAT Critical Reading 25th percentile score',
 'SAT Critical Reading 75th percentile score',
 'SAT Math 25th percentile score',
 'SAT Math 75th percentile score',
 'SAT Writing 25th percentile score',
 'SAT Writing 75th percentile score',
 'ACT Composite 25th percentile score',
 'ACT Composite 75th percentile score',
 'Estimated enrollment, total',
 'Estimated enrollment, full time',
 'Estimated enrollment, part time',
 'Estimated undergraduate enrollment, total',
 'Estimated undergraduate enrollment, full time',
 'Estimated undergraduate enrollment, part time',
 'Estimated freshman undergraduate enrollment, total',
 'Estimated freshman enrollment, full time',
 'Estimated freshman enrollment, part time',
 'Estimated graduate enrollment, total',
 'Estimated graduate enrollment, full time',
 'Estimated graduate enrollment, part time',
 "Associate's degrees awarded",
 "Bachelor's degrees awarded",
 "Master's degrees awarded",
 "Doctor's degrese - research/scholarship awarded",
 "Doctor's degrees - professional practice awarded",
 "Doctor's degrees - other awarded",
 'Certificates of less than 1-year awarded',
 'Certificates of 1 but less than 2-years awarded',
 'Certificates of 2 but less than 4-years awarded',
 'Postbaccalaureate certificates awarded',
 "Post-master's certificates awarded",
 "Number of students receiving an Associate's degree",
 "Number of students receiving a Bachelor's degree",
 "Number of students receiving a Master's degree",
 "Number of students receiving a Doctor's degree",
 'Number of students receiving a certificate of less than 1-year',
 'Number of students receiving a certificate of 1 but less than 4-years',
 "Number of students receiving a Postbaccalaureate or Post-master's certificate",
 'Percent admitted - total',
 'Admissions yield - total',
 'Tuition and fees, 2010-11',
 'Tuition and fees, 2011-12',
 'Tuition and fees, 2012-13',
 'Tuition and fees, 2013-14',
 'Total price for in-state students living on campus 2013-14',
 'Total price for out-of-state students living on campus 2013-14',
 'State abbreviation',
 'FIPS state code',
 'Geographic region',
 'Sector of institution',
 'Level of institution',
 'Control of institution',
 'Historically Black College or University',
 'Tribal college',
 'Degree of urbanization (Urban-centric locale)',
 'Carnegie Classification 2010: Basic',
 'Total  enrollment',
 'Full-time enrollment',
 'Part-time enrollment',
 'Undergraduate enrollment',
 'Graduate enrollment',
 'Full-time undergraduate enrollment',
 'Part-time undergraduate enrollment',
 'Percent of total enrollment that are American Indian or Alaska Native',
 'Percent of total enrollment that are Asian',
 'Percent of total enrollment that are Black or African American',
 'Percent of total enrollment that are Hispanic/Latino',
 'Percent of total enrollment that are Native Hawaiian or Other Pacific Islander',
 'Percent of total enrollment that are White',
 'Percent of total enrollment that are two or more races',
 'Percent of total enrollment that are Race/ethnicity unknown',
 'Percent of total enrollment that are Nonresident Alien',
 'Percent of total enrollment that are Asian/Native Hawaiian/Pacific Islander',
 'Percent of total enrollment that are women',
 'Percent of undergraduate enrollment that are American Indian or Alaska Native',
 'Percent of undergraduate enrollment that are Asian',
 'Percent of undergraduate enrollment that are Black or African American',
 'Percent of undergraduate enrollment that are Hispanic/Latino',
 'Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander',
 'Percent of undergraduate enrollment that are White',
 'Percent of undergraduate enrollment that are two or more races',
 'Percent of undergraduate enrollment that are Race/ethnicity unknown',
 'Percent of undergraduate enrollment that are Nonresident Alien',
 'Percent of undergraduate enrollment that are Asian/Native Hawaiian/Pacific Islander',
 'Percent of undergraduate enrollment that are women',
 'Percent of graduate enrollment that are American Indian or Alaska Native',
 'Percent of graduate enrollment that are Asian',
 'Percent of graduate enrollment that are Black or African American',
 'Percent of graduate enrollment that are Hispanic/Latino',
 'Percent of graduate enrollment that are Native Hawaiian or Other Pacific Islander',
 'Percent of graduate enrollment that are White',
 'Percent of graduate enrollment that are two or more races',
 'Percent of graduate enrollment that are Race/ethnicity unknown',
 'Percent of graduate enrollment that are Nonresident Alien',
 'Percent of graduate enrollment that are Asian/Native Hawaiian/Pacific Islander',
 'Percent of graduate enrollment that are women',
 'Number of first-time undergraduates - in-state',
 'Percent of first-time undergraduates - in-state',
 'Number of first-time undergraduates - out-of-state',
 'Percent of first-time undergraduates - out-of-state',
 'Number of first-time undergraduates - foreign countries',
 'Percent of first-time undergraduates - foreign countries',
 'Number of first-time undergraduates - residence unknown',
 'Percent of first-time undergraduates - residence unknown',
 'Graduation rate - Bachelor degree within 4 years, total',
 'Graduation rate - Bachelor degree within 5 years, total',
 'Graduation rate - Bachelor degree within 6 years, total',
 'Percent of freshmen receiving any financial aid',
 'Percent of freshmen receiving federal, state, local or institutional grant aid',
 'Percent of freshmen  receiving federal grant aid',
 'Percent of freshmen receiving Pell grants',
 'Percent of freshmen receiving other federal grant aid',
 'Percent of freshmen receiving state/local grant aid',
 'Percent of freshmen receiving institutional grant aid',
 'Percent of freshmen receiving student loan aid',
 'Percent of freshmen receiving federal student loans',
 'Percent of freshmen receiving other loan aid',
 'Endowment assets (year end) per FTE enrollment (GASB)',
 'Endowment assets (year end) per FTE enrollment (FASB)']

Data cleaning¶

In [5]:
# explore diff columns and their values
df.Name.value_counts()
Out[5]:
Westminster College                              3
Union College                                    3
Columbia College                                 2
University of St Thomas                          2
Bethel University                                2
                                                ..
University of Maryland-College Park              1
University of Maryland-Baltimore County          1
University of Maryland-University College        1
Loyola University Maryland                       1
Polytechnic University of Puerto Rico-Orlando    1
Name: Name, Length: 1517, dtype: int64
In [6]:
# why are there multiple entries of the same college?
# oh they just have the same name, but they're in different locations
df[df.Name == "Bethel University"]
Out[6]:
ID number Name year ZIP code Highest degree offered County name Longitude location of institution Latitude location of institution Religious affiliation Offers Less than one year certificate ... Percent of freshmen receiving federal grant aid Percent of freshmen receiving Pell grants Percent of freshmen receiving other federal grant aid Percent of freshmen receiving state/local grant aid Percent of freshmen receiving institutional grant aid Percent of freshmen receiving student loan aid Percent of freshmen receiving federal student loans Percent of freshmen receiving other loan aid Endowment assets (year end) per FTE enrollment (GASB) Endowment assets (year end) per FTE enrollment (FASB)
624 173160 Bethel University 2013 55112 Doctor's degree - research/scholarship Ramsey County -93.159295 45.057091 Baptist Yes ... 28.0 25.0 6.0 30.0 100.0 70.0 69.0 18.0 NaN 6617.0
1225 219718 Bethel University 2013 38201 Master's degree Carroll County -88.516162 36.139271 Cumberland Presbyterian Implied no ... 76.0 76.0 29.0 32.0 45.0 84.0 84.0 0.0 NaN 518.0

2 rows × 145 columns

In [7]:
# should we just study undergrad data?
df["Offers Bachelor's degree"].value_counts()
Out[7]:
Yes           1522
Implied no      10
Name: Offers Bachelor's degree, dtype: int64
In [8]:
# how many colleges offer bachelors degree = 1522 out of 1534
# there seem to be enough -- drop non-undergrad institutions
df = df[df["Offers Bachelor's degree"] == "Yes"]
In [9]:
# drop na in response (admission rate column)
df = df[~df["Percent admitted - total"].isna()]
In [10]:
# add region column
west = ["Washington","Oregon","California","Alaska","Idaho","Montana",
        "Wyoming","Nevada","Utah","Colorado","Arizona","New Mexico",
       "Hawaii"]
midwest = ["North Dakota", "South Dakota", "Nebraska","Kansas",
          "Minnesota","Iowa","Missouri","Wisconsin","Illinois",
          "Michigan","Indiana","Ohio"]
northeast = ["Maine","Vermont","New Hampshire","Massachusetts",
            "Connecticut","Rhode Island","New York","Pennsylvania",
            "New Jersey"]
south = ["Texas","Oklahoma","Arkansas","Louisiana","Mississippi",
        "Kentucky","Tennessee","Alabama","West Virginia","Virginia",
        "North Carolina","South Carolina","Georgia","Florida",
        "District of Columbia","Delaware","Maryland"]

def region(row):
    if row in west:
        return "West"
    elif row in midwest:
        return "Midwest"
    elif row in northeast:
        return "Northeast"
    else:
        return "South"
    
df["Region"] = df["State abbreviation"].apply(region)
In [11]:
# fix bool columns - turn into dummy variables for use in lin reg model
# edit: didn't end up using lin reg model
df[["Religious affiliation","Historically Black College or University"]]

def religious(row):
    if row == "Not applicable":
        return 0
    else:
        return 1
def hbcu(row):
    if row == "Yes":
        return 1
    else:
        return 0
def priv(row):
    if row == "Private not-for-profit":
        return 1
    else:
        return 0
    
df["Religious affiliation"] = df["Religious affiliation"].apply(religious)
df["Historically Black College or University"] = df["Historically Black College or University"].apply(hbcu)
df["Private"] = df["Control of institution"].apply(priv)

df[["Religious affiliation","Historically Black College or University","Private"]]
Out[11]:
Religious affiliation Historically Black College or University Private
0 0 1 0
1 0 0 0
3 0 0 0
4 0 1 0
5 0 0 0
... ... ... ...
1516 1 0 1
1525 0 0 0
1529 0 0 0
1530 1 0 1
1532 1 0 1

1376 rows × 3 columns

In [12]:
# add groupings of admission rates - by tens
def nearest_ten(row):
    a = row / 10
    a = math.floor(a)
    return a * 10
df["Admission rate - tens"] = df["Percent admitted - total"].apply(nearest_ten)
df[["Admission rate - tens","Percent admitted - total"]]
Out[12]:
Admission rate - tens Percent admitted - total
0 90 90.0
1 80 87.0
3 80 81.0
4 50 51.0
5 50 57.0
... ... ...
1516 60 60.0
1525 40 44.0
1529 30 35.0
1530 70 71.0
1532 50 53.0

1376 rows × 2 columns

In [13]:
# add groupings for urbanization, clean up category names
print(df["Degree of urbanization (Urban-centric locale)"].value_counts())

def urban(row):
    return row.split(":")[0]

df["Degree of urbanization"] = df["Degree of urbanization (Urban-centric locale)"].apply(urban)

print(df["Degree of urbanization"].value_counts())
City: Large        268
Suburb: Large      259
City: Small        199
City: Midsize      167
Town: Distant      148
Town: Remote       116
Town: Fringe        56
Rural: Fringe       47
Suburb: Midsize     47
Suburb: Small       31
Rural: Distant      25
Rural: Remote       13
Name: Degree of urbanization (Urban-centric locale), dtype: int64
City      634
Suburb    337
Town      320
Rural      85
Name: Degree of urbanization, dtype: int64
In [14]:
# filter to cols to keep

# maintain list of cols to keep (drop the rest later)
cols_i_want = ['Name','Longitude location of institution',
       'Latitude location of institution', 'Region', 'Religious affiliation',
       'Applicants total', 'Admissions total', 'Enrolled total',
       'Tuition and fees, 2010-11',
       'Tuition and fees, 2011-12', 'Tuition and fees, 2012-13',
       'Tuition and fees, 2013-14',
       'Total price for in-state students living on campus 2013-14',
       'Total price for out-of-state students living on campus 2013-14',
       'Historically Black College or University',
       'Undergraduate enrollment',
       'Percent of undergraduate enrollment that are American Indian or Alaska Native',
       'Percent of undergraduate enrollment that are Asian',
       'Percent of undergraduate enrollment that are Black or African American',
       'Percent of undergraduate enrollment that are Hispanic/Latino',
       'Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander',
       'Percent of undergraduate enrollment that are White',
       'Percent of undergraduate enrollment that are two or more races',
       'Percent of undergraduate enrollment that are Race/ethnicity unknown',
       'Percent of undergraduate enrollment that are Nonresident Alien',
       'Percent of undergraduate enrollment that are Asian/Native Hawaiian/Pacific Islander',
       'Percent of undergraduate enrollment that are women',
       'Graduation rate - Bachelor degree within 4 years, total',
       'Graduation rate - Bachelor degree within 5 years, total',
       'Graduation rate - Bachelor degree within 6 years, total',
       'Percent of freshmen receiving any financial aid',
       'Percent admitted - total',
        "Admission rate - tens",
        "Private",
        "Degree of urbanization",
        "State abbreviation"]

df = df[cols_i_want]
In [15]:
# look at NA's
df.isna().sum(axis=0)
Out[15]:
Name                                                                                       0
Longitude location of institution                                                          0
Latitude location of institution                                                           0
Region                                                                                     0
Religious affiliation                                                                      0
Applicants total                                                                           0
Admissions total                                                                           0
Enrolled total                                                                             0
Tuition and fees, 2010-11                                                                  4
Tuition and fees, 2011-12                                                                  4
Tuition and fees, 2012-13                                                                  3
Tuition and fees, 2013-14                                                                  0
Total price for in-state students living on campus 2013-14                                50
Total price for out-of-state students living on campus 2013-14                            50
Historically Black College or University                                                   0
Undergraduate enrollment                                                                   0
Percent of undergraduate enrollment that are American Indian or Alaska Native              0
Percent of undergraduate enrollment that are Asian                                         0
Percent of undergraduate enrollment that are Black or African American                     0
Percent of undergraduate enrollment that are Hispanic/Latino                               0
Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander     0
Percent of undergraduate enrollment that are White                                         0
Percent of undergraduate enrollment that are two or more races                             0
Percent of undergraduate enrollment that are Race/ethnicity unknown                        0
Percent of undergraduate enrollment that are Nonresident Alien                             0
Percent of undergraduate enrollment that are Asian/Native Hawaiian/Pacific Islander        0
Percent of undergraduate enrollment that are women                                         0
Graduation rate - Bachelor degree within 4 years, total                                    9
Graduation rate - Bachelor degree within 5 years, total                                    9
Graduation rate - Bachelor degree within 6 years, total                                    9
Percent of freshmen receiving any financial aid                                            3
Percent admitted - total                                                                   0
Admission rate - tens                                                                      0
Private                                                                                    0
Degree of urbanization                                                                     0
State abbreviation                                                                         0
dtype: int64
In [16]:
# interpolate with mean
df[["Graduation rate - Bachelor degree within 4 years, total",
   "Percent of freshmen receiving any financial aid"]].describe()
Out[16]:
Graduation rate - Bachelor degree within 4 years, total Percent of freshmen receiving any financial aid
count 1367.000000 1373.000000
mean 39.057791 90.779315
std 21.494738 11.574645
min 0.000000 41.000000
25% 22.000000 87.000000
50% 36.000000 95.000000
75% 54.000000 99.000000
max 100.000000 100.000000
In [17]:
df["Graduation rate - Bachelor degree within 4 years, total"] = df["Graduation rate - Bachelor degree within 4 years, total"].fillna(39.057791)
df["Percent of freshmen receiving any financial aid"] = df["Percent of freshmen receiving any financial aid"].fillna(90.779315)

EDA¶

In [18]:
# save categorical vars to use for plots
categ_cols = ["Name","Region","Religious affiliation","Private","Historically Black College or University",
             "Degree of urbanization", "State abbreviation"]

Dive deeper into admission rate column

In [19]:
df["Percent admitted - total"].describe()
Out[19]:
count    1376.000000
mean       64.569767
std        18.710062
min         6.000000
25%        54.000000
50%        67.000000
75%        78.000000
max       100.000000
Name: Percent admitted - total, dtype: float64
In [20]:
# what colleges have the lowest AR?
df[df["Percent admitted - total"] == 6].Name
Out[20]:
544      Harvard University
1469    Stanford University
Name: Name, dtype: object
In [21]:
# what colleges have the highest AR?
df[df["Percent admitted - total"] == 100].Name
Out[21]:
15              University of West Alabama
95                      Coleman University
281              Lewis-Clark State College
283         Brigham Young University-Idaho
451                University of Pikeville
633          Metropolitan State University
709      Montana State University-Northern
799          College of Staten Island CUNY
1313    The University of Texas at El Paso
1343                       Goddard College
1385          Southern Virginia University
1479                   Brandman University
Name: Name, dtype: object
In [22]:
# hist of AR
sns.histplot(data=df, x="Percent admitted - total", color="#95b8d1")
plt.title("Distribution of Admission Rates")
Out[22]:
Text(0.5, 1.0, 'Distribution of Admission Rates')

Geographic distribution of colleges in data

In [23]:
# distribution of undergrad universities across U.S.
df.plot(kind="scatter", x="Longitude location of institution", y="Latitude location of institution", s=3)
plt.title("Locations of U.S. Colleges")
plt.show()

# chosen by https://www.50states.com/city/regions.html
sns.scatterplot(x="Longitude location of institution", y="Latitude location of institution",
                data=df, hue="Region")
plt.title("Locations of U.S. Colleges by Region")
plt.show()
plt.show()
In [24]:
# % colleges from each region
df["Region"].value_counts()
print("south:", df["Region"].value_counts()["South"]/len(df))
print("northeast:", df["Region"].value_counts()["Northeast"]/len(df))
print("midwest:", df["Region"].value_counts()["Midwest"]/len(df))
print("west:", df["Region"].value_counts()["West"]/len(df))
south: 0.34084302325581395
northeast: 0.27325581395348836
midwest: 0.2558139534883721
west: 0.1300872093023256
In [25]:
# show AR across geography
sns.relplot(x="Longitude location of institution", y="Latitude location of institution", hue="Admission rate - tens",
            size="Admission rate - tens",
            sizes=(50, 300), palette="mako",
            height=6, edgecolor="none",data=df.sort_values(["Admission rate - tens"], ascending=False))
Out[25]:
<seaborn.axisgrid.FacetGrid at 0x7fd937c452b0>

How does average AR vary across different college types / breakdowns?

In [26]:
for col in categ_cols:
    if col == "Name":
        continue
    sns.barplot(data=df, x=col, y="Percent admitted - total")
    t = "Average Admission Rate by " + col
    if "State" in col:
        plt.xticks(rotation = 90)
    plt.title(t)
    plt.show()

How is each category distributed in the data? And how does that look across different AR's?

In [27]:
for col in categ_cols:
    if col == "Name":
        continue
    sns.countplot(x=col, data=df)
    t = "Counts of " + col + " in data"
    plt.title(t)
    plt.show()
    sns.countplot(x='Admission rate - tens', data=df, hue=col)
    t = "Frequency of Various AR's by " + col
    if "State" in col:
        plt.xticks(rotation = 90)
    plt.title(t)
    plt.show()

for fun, compare to OSU

In [28]:
df_osu=df[(df["Name"] == "Ohio State University-Main Campus") | (df["Name"] == "University of Michigan-Ann Arbor")]

sns.barplot(data=df_osu, x="Name", y="Percent admitted - total",palette=["#00274C","#D10000"])
plt.xticks(rotation=10)
Out[28]:
(array([0, 1]),
 [Text(0, 0, 'University of Michigan-Ann Arbor'),
  Text(1, 0, 'Ohio State University-Main Campus')])

Explore trends with tuition

In [29]:
df[['Tuition and fees, 2010-11', 'Tuition and fees, 2011-12',
       'Tuition and fees, 2012-13', 'Tuition and fees, 2013-14']].describe()
Out[29]:
Tuition and fees, 2010-11 Tuition and fees, 2011-12 Tuition and fees, 2012-13 Tuition and fees, 2013-14
count 1372.000000 1372.000000 1373.000000 1376.000000
mean 19100.625364 20032.930758 20870.368536 21604.350291
std 11190.473914 11585.882264 12017.088416 12483.377722
min 910.000000 910.000000 3770.000000 3850.000000
25% 7861.750000 8360.750000 8710.000000 8975.000000
50% 19617.500000 20495.000000 21273.000000 22225.000000
75% 27346.250000 28662.750000 29840.000000 30897.750000
max 43990.000000 45290.000000 47246.000000 49138.000000
In [30]:
df_tuition_over_time = pd.DataFrame({"Year":[2010,2011,2012,2013],
                                    "Tuition":[19100.625364,20032.930758,20870.368536,21604.350291]})
sns.lineplot(data=df_tuition_over_time, x="Year", y="Tuition")
plt.title("Average Tuition over Time")
Out[30]:
Text(0.5, 1.0, 'Average Tuition over Time')
In [31]:
for col in categ_cols:  
    if col == "Name":
        continue
    tuition_cols = ["Tuition and fees, 2010-11","Tuition and fees, 2011-12",
                                                   "Tuition and fees, 2012-13","Tuition and fees, 2013-14"]
    temp_df = df.groupby([col]).mean(numeric_only=True)[tuition_cols]
    breakdowns = temp_df.index
    tuition_over_years = []

    for r in breakdowns:
        for c in tuition_cols:
            tuition_over_years.append(temp_df.loc[r,c])

    new_df = pd.DataFrame({col:list(np.repeat([breakdowns],4)),
                 "Year":[2010,2011,2012,2013] * len(breakdowns),
                 "Tuition":tuition_over_years})
    sns.lineplot(data=new_df, x="Year", y="Tuition",hue=col)
    t="Tuition from 2010-2013 by " + col
    plt.title(t)
    plt.show()

Trends with AR and race

In [32]:
demog_cols = ['Percent of undergraduate enrollment that are American Indian or Alaska Native',
       'Percent of undergraduate enrollment that are Asian',
       'Percent of undergraduate enrollment that are Black or African American',
       'Percent of undergraduate enrollment that are Hispanic/Latino',
       'Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander',
       'Percent of undergraduate enrollment that are White',
       'Percent of undergraduate enrollment that are two or more races',
       'Percent of undergraduate enrollment that are Race/ethnicity unknown',
       'Percent of undergraduate enrollment that are Nonresident Alien',
       'Percent of undergraduate enrollment that are Asian/Native Hawaiian/Pacific Islander',
       'Percent of undergraduate enrollment that are women']
df_demog = df.groupby(["Admission rate - tens"])[demog_cols].mean(numeric_only=True)
df_demog["Admission rate - tens"] = [0,10,20,30,40,50,60,70,80,90,100]

for col in demog_cols:
    sns.barplot(data=df_demog, x="Admission rate - tens",y=col)
    plt.ylabel("")
    t = col + "\nacross Admission Rates"
    plt.title(t)
    plt.show()
In [33]:
df_race = df[["Percent admitted - total","Percent of undergraduate enrollment that are Asian",
   "Percent of undergraduate enrollment that are Hispanic/Latino","Percent of undergraduate enrollment that are White",
   "Percent of undergraduate enrollment that are Black or African American"]]

adm_rate = [100,90,80,70,60,50,40,30,20,10]
race_list = ["Asian","Hispanic/Latino","White","Black"] * len(adm_rate)

df_race_ar = pd.DataFrame({"Race":race_list})
pct_enrl = []
for ar in adm_rate:
    temp_df = df_race[df_race["Percent admitted - total"] < ar].describe()
    for i in range(1,5):
        pct_enrl.append(temp_df.iloc[1,i])
    
df_race_ar["Percent enrollment"] = pct_enrl
df_race_ar["Admission Rate"] = list(np.repeat(adm_rate,4))

df_race_ar
    
Out[33]:
Race Percent enrollment Admission Rate
0 Asian 4.107771 100
1 Hispanic/Latino 8.958211 100
2 White 60.999267 100
3 Black 13.037390 100
4 Asian 4.266929 90
5 Hispanic/Latino 8.965354 90
6 White 60.770079 90
7 Black 13.063780 90
8 Asian 4.481481 80
9 Hispanic/Latino 9.212963 80
10 White 59.436111 80
11 Black 13.723148 80
12 Asian 4.854777 70
13 Hispanic/Latino 9.685350 70
14 White 56.722293 70
15 Black 15.653503 70
16 Asian 5.678351 60
17 Hispanic/Latino 10.348454 60
18 White 52.373196 60
19 Black 18.307216 60
20 Asian 7.172794 50
21 Hispanic/Latino 10.805147 50
22 White 48.658088 50
23 Black 18.419118 50
24 Asian 9.288732 40
25 Hispanic/Latino 10.422535 40
26 White 45.746479 40
27 Black 18.352113 40
28 Asian 12.136364 30
29 Hispanic/Latino 9.590909 30
30 White 45.863636 30
31 Black 15.621212 30
32 Asian 15.088235 20
33 Hispanic/Latino 9.794118 20
34 White 47.705882 20
35 Black 8.205882 20
36 Asian 17.666667 10
37 Hispanic/Latino 11.333333 10
38 White 42.111111 10
39 Black 6.222222 10
In [34]:
sns.lineplot(data=df_race_ar, y="Percent enrollment", x="Admission Rate",hue="Race")
   # plt.title("Admis ")
plt.show()

Diagnostics (for regression model)¶

https://medium.com/@abhilash.sirigari/a-complete-model-diagnostics-of-multivariate-linear-regression-90aace20ecaf

In [35]:
# investigate multicollinearity
corr = df.corr(numeric_only=True)
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)
Out[35]:
<AxesSubplot: >
In [36]:
# based on multicollinearity or deemed irrelevant, remove these columns before running lin reg
cols_to_remove_regression = ["Admissions total","Enrolled total","Tuition and fees, 2010-11","Tuition and fees, 2011-12",
                            "Tuition and fees, 2012-13","Graduation rate - Bachelor degree within 5 years, total",
                             "Graduation rate - Bachelor degree within 6 years, total",
                             "Total price for out-of-state students living on campus 2013-14",
                             "Total price for in-state students living on campus 2013-14",
                            "Percent of undergraduate enrollment that are two or more races",
                            "Percent of undergraduate enrollment that are Race/ethnicity unknown",
                            "Percent of undergraduate enrollment that are Asian/Native Hawaiian/Pacific Islander",
                            "Admission rate - tens","Name","State abbreviation",
                            'Percent of freshmen submitting SAT scores']
In [37]:
# remove cols
# check back to remove SAT/ACT, longitude/latitude, white/asian pct, applicants?, hbcu vs white pct
regression_cols = []
for col in df.columns:
    if col not in cols_to_remove_regression:
        regression_cols.append(col)
        
df = df[regression_cols]
df.columns
Out[37]:
Index(['Longitude location of institution', 'Latitude location of institution',
       'Region', 'Religious affiliation', 'Applicants total',
       'Tuition and fees, 2013-14', 'Historically Black College or University',
       'Undergraduate enrollment',
       'Percent of undergraduate enrollment that are American Indian or Alaska Native',
       'Percent of undergraduate enrollment that are Asian',
       'Percent of undergraduate enrollment that are Black or African American',
       'Percent of undergraduate enrollment that are Hispanic/Latino',
       'Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander',
       'Percent of undergraduate enrollment that are White',
       'Percent of undergraduate enrollment that are Nonresident Alien',
       'Percent of undergraduate enrollment that are women',
       'Graduation rate - Bachelor degree within 4 years, total',
       'Percent of freshmen receiving any financial aid',
       'Percent admitted - total', 'Private', 'Degree of urbanization'],
      dtype='object')
In [38]:
# check new multicollinearity after removed cols
corr = df.corr(numeric_only=True)
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)
Out[38]:
<AxesSubplot: >
In [39]:
# add dummy variables
names_regions = df["Region"].value_counts().index
for n in names_regions:
    colname = "Region_" + n
    df[colname] = df["Region"] == n
    df[colname] = df[colname].map(int)
    
names_urban = df["Degree of urbanization"].value_counts().index
for n in names_urban:
    colname = "Deg_urban_" + n
    df[colname] = df["Degree of urbanization"] == n
    df[colname] = df[colname].map(int)
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df["Region"] == n
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].map(int)
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df["Region"] == n
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].map(int)
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df["Region"] == n
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].map(int)
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df["Region"] == n
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].map(int)
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df["Degree of urbanization"] == n
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].map(int)
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df["Degree of urbanization"] == n
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].map(int)
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df["Degree of urbanization"] == n
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].map(int)
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df["Degree of urbanization"] == n
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/225872125.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].map(int)
In [40]:
# remove original region / degree of urb columns bc we added new dummy cols
df = df.loc[:, ~df.columns.isin(['Region', 'Degree of urbanization'])]
In [41]:
# investigate scatterplots to find linear relationship with admission rate
for col in df.columns:
    sns.scatterplot(data=df, x=col, y="Percent admitted - total")
    plt.show()
    
# longitude isnt very linear
In [42]:
# compare to log transforms - seems like % columns prefer log transforms
for col in df.columns:
    if col == "Percent admitted - total":
        continue
    df_temp = df[[col,"Percent admitted - total"]]
    df_temp["log"] = df_temp[col].apply(np.log10)
    sns.scatterplot(data=df_temp, x="log", y="Percent admitted - total")
    plt.show()
    sns.scatterplot(data=df_temp, x=col, y="Percent admitted - total")
    plt.show()
    print(df_temp.corr())
    print("----------------------------------------------------")
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                   Longitude location of institution  \
Longitude location of institution                           1.000000   
Percent admitted - total                                   -0.060912   
log                                                              NaN   

                                   Percent admitted - total  log  
Longitude location of institution                 -0.060912  NaN  
Percent admitted - total                           1.000000  NaN  
log                                                     NaN  NaN  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                  Latitude location of institution  \
Latitude location of institution                          1.000000   
Percent admitted - total                                  0.140716   
log                                                       0.994661   

                                  Percent admitted - total       log  
Latitude location of institution                  0.140716  0.994661  
Percent admitted - total                          1.000000  0.133161  
log                                               0.133161  1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Religious affiliation  Percent admitted - total  log
Religious affiliation                  1.000000                  0.055164  NaN
Percent admitted - total               0.055164                  1.000000  NaN
log                                         NaN                       NaN  NaN
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Applicants total  Percent admitted - total       log
Applicants total                  1.000000                 -0.348620  0.765698
Percent admitted - total         -0.348620                  1.000000 -0.308649
log                               0.765698                 -0.308649  1.000000
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                           Tuition and fees, 2013-14  \
Tuition and fees, 2013-14                   1.000000   
Percent admitted - total                   -0.247203   
log                                         0.966046   

                           Percent admitted - total       log  
Tuition and fees, 2013-14                 -0.247203  0.966046  
Percent admitted - total                   1.000000 -0.174953  
log                                       -0.174953  1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                          Historically Black College or University  \
Historically Black College or University                                  1.000000   
Percent admitted - total                                                 -0.167663   
log                                                                            NaN   

                                          Percent admitted - total  log  
Historically Black College or University                 -0.167663  NaN  
Percent admitted - total                                  1.000000  NaN  
log                                                            NaN  NaN  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Undergraduate enrollment  Percent admitted - total  \
Undergraduate enrollment                  1.000000                 -0.027950   
Percent admitted - total                 -0.027950                  1.000000   
log                                       0.849547                 -0.011406   

                               log  
Undergraduate enrollment  0.849547  
Percent admitted - total -0.011406  
log                       1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                    Percent of undergraduate enrollment that are American Indian or Alaska Native  \
Percent of undergraduate enrollment that are Am...                                           1.000000                               
Percent admitted - total                                                                     0.101109                               
log                                                                                          0.886045                               

                                                    Percent admitted - total  \
Percent of undergraduate enrollment that are Am...                  0.101109   
Percent admitted - total                                            1.000000   
log                                                                 0.085188   

                                                         log  
Percent of undergraduate enrollment that are Am...  0.886045  
Percent admitted - total                            0.085188  
log                                                 1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                    Percent of undergraduate enrollment that are Asian  \
Percent of undergraduate enrollment that are Asian                                           1.000000    
Percent admitted - total                                                                    -0.306807    
log                                                                                          0.865223    

                                                    Percent admitted - total  \
Percent of undergraduate enrollment that are Asian                 -0.306807   
Percent admitted - total                                            1.000000   
log                                                                -0.355597   

                                                         log  
Percent of undergraduate enrollment that are Asian  0.865223  
Percent admitted - total                           -0.355597  
log                                                 1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                    Percent of undergraduate enrollment that are Black or African American  \
Percent of undergraduate enrollment that are Bl...                                           1.000000                        
Percent admitted - total                                                                    -0.159516                        
log                                                                                          0.815842                        

                                                    Percent admitted - total  \
Percent of undergraduate enrollment that are Bl...                 -0.159516   
Percent admitted - total                                            1.000000   
log                                                                -0.150307   

                                                         log  
Percent of undergraduate enrollment that are Bl...  0.815842  
Percent admitted - total                           -0.150307  
log                                                 1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                    Percent of undergraduate enrollment that are Hispanic/Latino  \
Percent of undergraduate enrollment that are Hi...                                           1.000000              
Percent admitted - total                                                                    -0.078636              
log                                                                                          0.829239              

                                                    Percent admitted - total  \
Percent of undergraduate enrollment that are Hi...                 -0.078636   
Percent admitted - total                                            1.000000   
log                                                                -0.138014   

                                                         log  
Percent of undergraduate enrollment that are Hi...  0.829239  
Percent admitted - total                           -0.138014  
log                                                 1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                    Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander  \
Percent of undergraduate enrollment that are Na...                                           1.000000                                        
Percent admitted - total                                                                     0.052389                                        
log                                                                                          0.875619                                        

                                                    Percent admitted - total  \
Percent of undergraduate enrollment that are Na...                  0.052389   
Percent admitted - total                                            1.000000   
log                                                                 0.190898   

                                                         log  
Percent of undergraduate enrollment that are Na...  0.875619  
Percent admitted - total                            0.190898  
log                                                 1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                    Percent of undergraduate enrollment that are White  \
Percent of undergraduate enrollment that are White                                           1.000000    
Percent admitted - total                                                                     0.295258    
log                                                                                          0.882812    

                                                    Percent admitted - total  \
Percent of undergraduate enrollment that are White                  0.295258   
Percent admitted - total                                            1.000000   
log                                                                 0.225861   

                                                         log  
Percent of undergraduate enrollment that are White  0.882812  
Percent admitted - total                            0.225861  
log                                                 1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                    Percent of undergraduate enrollment that are Nonresident Alien  \
Percent of undergraduate enrollment that are No...                                           1.000000                
Percent admitted - total                                                                    -0.256556                
log                                                                                          0.870009                

                                                    Percent admitted - total  \
Percent of undergraduate enrollment that are No...                 -0.256556   
Percent admitted - total                                            1.000000   
log                                                                -0.277770   

                                                         log  
Percent of undergraduate enrollment that are No...  0.870009  
Percent admitted - total                           -0.277770  
log                                                 1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                    Percent of undergraduate enrollment that are women  \
Percent of undergraduate enrollment that are women                                           1.000000    
Percent admitted - total                                                                     0.030134    
log                                                                                          0.942661    

                                                    Percent admitted - total  \
Percent of undergraduate enrollment that are women                  0.030134   
Percent admitted - total                                            1.000000   
log                                                                 0.020988   

                                                         log  
Percent of undergraduate enrollment that are women  0.942661  
Percent admitted - total                            0.020988  
log                                                 1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                    Graduation rate - Bachelor degree within 4 years, total  \
Graduation rate - Bachelor degree within 4 year...                                           1.000000         
Percent admitted - total                                                                    -0.291469         
log                                                                                          0.922294         

                                                    Percent admitted - total  \
Graduation rate - Bachelor degree within 4 year...                 -0.291469   
Percent admitted - total                                            1.000000   
log                                                                -0.167215   

                                                         log  
Graduation rate - Bachelor degree within 4 year...  0.922294  
Percent admitted - total                           -0.167215  
log                                                 1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                                                 Percent of freshmen receiving any financial aid  \
Percent of freshmen receiving any financial aid                                         1.000000   
Percent admitted - total                                                                0.348078   
log                                                                                     0.994311   

                                                 Percent admitted - total  \
Percent of freshmen receiving any financial aid                  0.348078   
Percent admitted - total                                         1.000000   
log                                                              0.367654   

                                                      log  
Percent of freshmen receiving any financial aid  0.994311  
Percent admitted - total                         0.367654  
log                                              1.000000  
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Private  Percent admitted - total  log
Private                   1.00000                  -0.13153  NaN
Percent admitted - total -0.13153                   1.00000  NaN
log                           NaN                       NaN  NaN
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Region_South  Percent admitted - total  log
Region_South                  1.000000                 -0.096833  NaN
Percent admitted - total     -0.096833                  1.000000  NaN
log                                NaN                       NaN  NaN
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Region_Northeast  Percent admitted - total  log
Region_Northeast                  1.000000                 -0.033416  NaN
Percent admitted - total         -0.033416                  1.000000  NaN
log                                    NaN                       NaN  NaN
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Region_Midwest  Percent admitted - total  log
Region_Midwest                  1.000000                  0.127834  NaN
Percent admitted - total        0.127834                  1.000000  NaN
log                                  NaN                       NaN  NaN
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Region_West  Percent admitted - total  log
Region_West                  1.000000                  0.014902  NaN
Percent admitted - total     0.014902                  1.000000  NaN
log                               NaN                       NaN  NaN
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Deg_urban_City  Percent admitted - total  log
Deg_urban_City                  1.000000                 -0.066668  NaN
Percent admitted - total       -0.066668                  1.000000  NaN
log                                  NaN                       NaN  NaN
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Deg_urban_Suburb  Percent admitted - total  log
Deg_urban_Suburb                  1.000000                 -0.007049  NaN
Percent admitted - total         -0.007049                  1.000000  NaN
log                                    NaN                       NaN  NaN
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Deg_urban_Town  Percent admitted - total  log
Deg_urban_Town                  1.000000                  0.057271  NaN
Percent admitted - total        0.057271                  1.000000  NaN
log                                  NaN                       NaN  NaN
----------------------------------------------------
/var/folders/g4/86zp2jms1qxc7lry7pqtg1c80000gn/T/ipykernel_81988/4278837565.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp["log"] = df_temp[col].apply(np.log10)
                          Deg_urban_Rural  Percent admitted - total  log
Deg_urban_Rural                  1.000000                  0.050127  NaN
Percent admitted - total         0.050127                  1.000000  NaN
log                                   NaN                       NaN  NaN
----------------------------------------------------

Regression (never used model)¶

In [43]:
x = df[['Longitude location of institution', 'Latitude location of institution',
       'Religious affiliation', 'Applicants total',
       'Tuition and fees, 2013-14', 'Historically Black College or University',
       'Undergraduate enrollment',
       'Percent of undergraduate enrollment that are American Indian or Alaska Native',
       'Percent of undergraduate enrollment that are Asian',
       'Percent of undergraduate enrollment that are Black or African American',
       'Percent of undergraduate enrollment that are Hispanic/Latino',
       'Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander',
       'Percent of undergraduate enrollment that are White',
       'Percent of undergraduate enrollment that are Nonresident Alien',
       'Percent of undergraduate enrollment that are women',
       'Graduation rate - Bachelor degree within 4 years, total',
       'Percent of freshmen receiving any financial aid',
       'Private', 'Region_South',
       'Region_Northeast', 'Region_Midwest', 'Region_West', 'Deg_urban_City',
       'Deg_urban_Suburb', 'Deg_urban_Town', 'Deg_urban_Rural']]
y = df['Percent admitted - total']
 
In [44]:
import statsmodels.api as sm

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

#view model summary
print(model.summary())
                               OLS Regression Results                               
====================================================================================
Dep. Variable:     Percent admitted - total   R-squared:                       0.394
Model:                                  OLS   Adj. R-squared:                  0.383
Method:                       Least Squares   F-statistic:                     36.62
Date:                      Mon, 20 Feb 2023   Prob (F-statistic):          1.28e-128
Time:                              21:30:50   Log-Likelihood:                -5637.6
No. Observations:                      1376   AIC:                         1.133e+04
Df Residuals:                          1351   BIC:                         1.146e+04
Df Model:                                24                                         
Covariance Type:                  nonrobust                                         
==========================================================================================================================================================
                                                                                             coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------------------------------------------------
const                                                                                     19.3632      8.030      2.411      0.016       3.611      35.115
Longitude location of institution                                                         -0.0083      0.076     -0.110      0.912      -0.157       0.140
Latitude location of institution                                                           0.1879      0.141      1.333      0.183      -0.089       0.464
Religious affiliation                                                                      1.5262      1.185      1.288      0.198      -0.799       3.851
Applicants total                                                                          -0.0009   8.74e-05    -10.632      0.000      -0.001      -0.001
Tuition and fees, 2013-14                                                              -3.516e-05   8.46e-05     -0.415      0.678      -0.000       0.000
Historically Black College or University                                                  -1.7527      3.788     -0.463      0.644      -9.183       5.678
Undergraduate enrollment                                                                   0.0009      0.000      8.507      0.000       0.001       0.001
Percent of undergraduate enrollment that are American Indian or Alaska Native              0.3109      0.257      1.210      0.226      -0.193       0.815
Percent of undergraduate enrollment that are Asian                                        -0.3060      0.120     -2.555      0.011      -0.541      -0.071
Percent of undergraduate enrollment that are Black or African American                    -0.1616      0.074     -2.175      0.030      -0.307      -0.016
Percent of undergraduate enrollment that are Hispanic/Latino                              -0.0828      0.074     -1.126      0.260      -0.227       0.061
Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander     0.7613      0.371      2.052      0.040       0.033       1.489
Percent of undergraduate enrollment that are White                                         0.0639      0.063      1.009      0.313      -0.060       0.188
Percent of undergraduate enrollment that are Nonresident Alien                            -0.2868      0.120     -2.388      0.017      -0.522      -0.051
Percent of undergraduate enrollment that are women                                         0.0095      0.036      0.266      0.790      -0.061       0.080
Graduation rate - Bachelor degree within 4 years, total                                   -0.0969      0.033     -2.922      0.004      -0.162      -0.032
Percent of freshmen receiving any financial aid                                            0.3937      0.047      8.441      0.000       0.302       0.485
Private                                                                                   -4.3871      2.046     -2.144      0.032      -8.401      -0.374
Region_South                                                                               1.6270      1.700      0.957      0.339      -1.708       4.961
Region_Northeast                                                                           6.1729      1.881      3.282      0.001       2.483       9.863
Region_Midwest                                                                             3.5989      2.275      1.582      0.114      -0.864       8.062
Region_West                                                                                7.9644      3.607      2.208      0.027       0.889      15.040
Deg_urban_City                                                                             6.1011      2.167      2.815      0.005       1.850      10.352
Deg_urban_Suburb                                                                           5.7776      2.106      2.744      0.006       1.647       9.908
Deg_urban_Town                                                                             2.8391      2.223      1.277      0.202      -1.521       7.199
Deg_urban_Rural                                                                            4.6453      2.345      1.981      0.048       0.045       9.246
==============================================================================
Omnibus:                        5.448   Durbin-Watson:                   1.912
Prob(Omnibus):                  0.066   Jarque-Bera (JB):                6.355
Skew:                          -0.054   Prob(JB):                       0.0417
Kurtosis:                       3.315   Cond. No.                     1.12e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 7.43e-21. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [45]:
# check residual plots for each predictor
# applicants total, undergrad enrollment, financial aid vary w residuals - shouldn't use linear regression
for col in df.columns:
    if col == "Percent admitted - total":
        continue
    fig = plt.figure(figsize=(12,8))
    fig = sm.graphics.plot_regress_exog(model, col, fig=fig)
    plt.show()
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1